368 research outputs found

    Identification of Web Spam through Clustering of Website Structures

    Get PDF
    Spam websites are domains whose owners are not interested in using them as gates for their activities but they are parked to be sold in the secondary market of web domains. To transform the costs of the annual registration fees in an opportunity of revenues, spam websites most often host a large amount of ads in the hope that someone who lands on the site by chance clicks on some ads. Since parking has become a widespread activity, a large number of specialized companies have come out and made parking a straightforward task that simply requires to set the domain?s name servers appropriately. Although parking is a legal activity, spam websites have a deep negative impact on the information quality of the web and can significantly deteriorate the performances of most web mining tools. For example these websites can influence search engines results or introduce an extra burden for crawling systems. In addition, spam websites represent a cost for ad bidders that are obliged to pay for impressions or clicks that have a negligible probability to produce revenues. In this paper, we experimentally show that spam websites hosted by the same service provider tend to have similar look-and-feel. Exploiting this structural similarity we face the problem of the automatic identification of spam websites. In addition, we use the outcome of the classification for compiling the list of the name servers used by spam websites so that they can be discarded before the first connection just after the first DNS query. A dump of our dataset (including web pages and meta information) and the corresponding manual classification is freely available upon request

    Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval

    Get PDF
    Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty

    Extraction and classification of dense communities in the Web

    Get PDF
    The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying communities in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph) The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for dense communities of a hundred nodes or more. Moreover it is still about 80% even for small communities of twenty nodes and density at 50% of the arcs present. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords

    A Scalable Algorithm for Metric High-Quality Clustering in Information Retrieval Tasks

    Get PDF
    We consider the problem of finding efficiently a high quality k-clustering of n points in a (possibly discrete) metric space. Many methods are known when the point are vectors in a real vector space, and the distance function is a standard geometric distance such as L1, L2 (Euclidean) or L2 2 (squared Euclidean distance). In such cases efficiency is often sought via sophisticated multidimensional search structures for speeding up nearest neighbor queries (e.g. variants of kd-trees). Such techniques usually work well in spaces of moderately high dimension say up to 6 or 8). Our target is a scenario in which either the metric space cannot be mapped into a vector space, or, if this mapping is possible, the dimension of such a space is so high as to rule out the use of the above mentioned techniques. This setting is rather typical in Information Retrieval applications. We augment the well known furthest-point-first algorithm for kcenter clustering in metric spaces with a filtering step based on the triangular inequality and we compare this algorithm with some recent fast variants of the classical k-means iterative algorithm augmented with an analogous filtering schemes. We extensively tested the two solutions on synthetic geometric data and real data from Information Retrieval applications. The main conclusion we draw is that our modified furthest-point-first method attains solutions of better or comparable quality within a fraction of the time used by the fast k-means algorithm. Thus our algorithm is valuable when either real time constraints or the large amount of data highlight the poor scalability of traditional clustering methods

    Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

    Get PDF
    Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week

    Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution

    Get PDF
    This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Strik- ing the right balance between running time and cluster well- formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the ?y by processing only the snippets provided by the auxil- iary search engines, and use no external sources of knowl- edge. Clustering is performed by means of a fast version of the furthest-point-?rst algorithm for metric k-center cluster- ing. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering ef- fectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Di- rectory Project hierarchy. According to two widely accepted external\u27 metrics of clustering quality, Armil achieves bet- ter performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. On a standard 1GHz ma- chine, Armil performs clustering and labelling altogether in less than one second

    A Fast and Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage

    Get PDF
    Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNPs present in a chromosome (called the haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. Therefore fast heuristics have been proposed.In this paper we describe a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test our method on real data from the HapMap Project

    Packet Classification via Improved Space Decomposition Techniques

    Get PDF
    P ack et Classification is a common task in moder n Inter net r outers. The goal is to classify pack ets into "classes" or "flo ws" according to some ruleset that looks at multiple fields of each pack et. Differ entiated actions can then be applied to the traffic depending on the r esult of the classification. Ev en though rulesets can be expr essed in a r elati v ely compact way by using high le v el languages, the r esulting decision tr ees can partition the sear ch space (the set of possible attrib ute v alues) in a potentially v ery lar ge ( and mor e) number of r egions. This calls f or methods that scale to such lar ge pr oblem sizes, though the only scalable pr oposal in the literatur e so far is the one based on a F at In v erted Segment T r ee [1 ]. In this paper we pr opose a new geometric technique called G-filter f or pack et classification on dimensions. G-filter is based on an impr o v ed space decomposition technique. In addition to a theor etical analysis sho wing that classification in G-filter has time complexity and slightly super -linear space in the number of rules, we pr o vide thor ough experiments sho wing that the constants in v olv ed ar e extr emely small on a wide range of pr oblem sizes, and that G-filter impr o v e the best r esults in the literatur e f or lar ge pr oblem sizes, and is competiti v e f or small sizes as well

    A Framework to Evaluate Information Quality in Public Administration Website

    Get PDF
    This paper presents a framework aimed at assessing the capacity of Public Administration bodies (PA) to offer a good quality of information and service on their web portals. Our framework is based on the extraction of “.it” domain names registered by Italian public institutions and the subsequent analysis of their relative websites. The analysis foresees an automatic gathering of the web pages of PA portals by means of web crawling and an assessment of the quality of their online information services. This assessment is carried out by verifying their compliance with current legislation on the basis of the criteria established in government guidelines[1]. This approach provides an ongoing monitoring process of the PA websites that can contribute to the improvement of their overall quality. Moreover, our approach can also hopefully be of benefit to local governments in other countries. Available at: https://aisel.aisnet.org/pajais/vol5/iss3/3

    On the Benefits of Keyword Spreading in Sponsored Search Auctions: An Experimental Analysis

    Get PDF
    Sellers of goods or services wishing to participate in sponsored search auctions must define a pool of keywords that are matched on-line to the queries submitted by the users to a search engine. Sellers must also define the value of their bid to the search engine for showing their advertisements in case of a query-keyword match. In order to optimize its revenue a seller might decide to substitute a keyword with a high cost, thus likely to be the object of intense competition, with sets of related keywords that collectively have lower cost while capturing an equivalent volume of user clicks. This technique is called keyword spreading and has recently attracted the attention of several researchers in the area of sponsored search auctions. In this paper we describe an experimental benchmark that through large scale realistic simulations allows us to pin-point the potential benefits/drawbacks of keyword spreading for the players using this technique, for those not using it, and for the search engine itself. Experimental results reveal that keyword spreading is generally convenient (or non-damaging) to all parties involved
    • …
    corecore